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1. Introduction and summary 

In principle, it is our purpose to study learning in a neural net as it occurs in 
nature. The theory of recurrent neural nets Q provides us with a model of content- 
addressable memory as it might be realized, to some extent, in the brain. Learning, 
in such a model, corresponds to adjusting the synaptic matrix Wq in such a way that 
p memorized patterns (/i = l,...,p), become fixed points of the neuron state 
dynamics. This can be achieved in a recurrent neural net by sequentially clamping its 
neurons to a well-defined and unique set of patterns, and adjusting the weights of the 
connections according to some Hebbian learning rule. However, in reality, a neural 
net cannot be clamped to a fixed set of ideal patterns. A more realistic assumption 
would be that the clamping of the net to a pattern always is more or less distorted. 
Consider, for instance, the visual system as a system in which the clamping is imposed 
by input from the retina. Since neurons are noisy objects, which once in a while fire 
spontaneously, an internal representation of a stimulus in the brain will hardly ever 
be identical to the representation corresponding to a previous stimulus. 

We therefore introduce noise to the set of patterns, thus making the set less well- 
defined and less unique. A network state array of a net of N neurons is denoted 

by 

x := (xi,...,x N ) (1) 

where Xi = 1 if neuron i is active and x% = if it is non-active. At every learning 
step n, x will be similar to one of the p given patterns £ , . . . but it has nonzero 
probability, for each neuron i, (i — 1, . . . , N), of deviating from it. At each learning 
step n, synaptic connections Wij will adapt themselves, according to a Hebbian 
learning rule which is a function of the weights Wik and of the (binary) neuron states 
Xk, (k = 1, . . . , N). For the class of learning rules that we will use in our model, the 
case of noiseless learning has been studied in detail (Q, [pi, Q). At every learning 
step n a pattern (/i = 1, . . .p) is chosen. If, for each n, we put x = the resulting 
weights Wij for n — > oo are known to coincide with the pseudo-inverse solution. The 
pseudo-inverse solution is a particular solution of the under-determined set of pN 
equations 

N 

Yj»<i$ -9i = «(2tf - 1) (2) 
j'=i 

for the N(N — 1) unknowns Wij (i, j — 1, . . . , N; i ^ j), p < N. Here, k is a positive 
number, and Oi a constant. It is easily verified that these equations guarantee that 
the so-called stability coefficients 

N 

lt {e) = (^-i)(£ w ^- e *) ( 3 ) 

are positive for all patterns £^ . In our description, however, the network state 
during learning is determined by a probability distribution p^(x), centered around 
the patterns By means of the Master equation derived in section^, we will arrive 
at the following equation for the expectation value of the weights in the limit of n — > oo: 

p N 

-J2 & M 0z)N2^ - 1) - (J^MooXk - 0i)] Xj = (4) 
Pfj,=ixen k=i 
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where f2 is the collection of all possible 2 N arrays x. In contrast to the equations (g), 
this is a completely determined set of equations, of which the solution is essentially 
different from the pseudo-inverse solution of (g). It will turn out to exist only if the 
variance of the probability distribution p^ix) is non-zero, i.e., in the presence of noise. 
This is the subject of section ||. 

In section i we will study numerically the weights (wij) n as a function of n. 

In section 5 we will show that, on the average, this solution does yield stability 
coefficients close to k if the 'noise parameter', 6, which will be introduced in the 
probability distribution p^(x), is small enough, which shows that the solution found 
is stable indeed. 

Finally, in section [| we study the size of the basins of attraction around our new 
solution. It is found that the sizes are larger than those around the pseudo-inverse 
solution, a result that is in perfect agreement with earlier observations JH[ @, U that 
learning with noise enlarges the basins of attraction. In these earlier studies, however, 
no analytical expression for the average values of the weights has been given. 



2. Derivation of the Master Equation for a linear learning rule 

We consider a recurrent net of N binary neurons. The strengths of the synaptic 
connection between the post-synaptic neuron j and the pre-synaptic neuron i will be 
denoted by Wij. The neurons i {i = 1, . . . , N) can take the values x< = or x» = 1, 
corresponding to the non-active and active state, respectively. It is useful to associate 
with each neuron i a set Vi, defined as the collection of neuron indices j with which 
neuron i has an adaptable, i.e., a non-zero, non-constant afferent synaptic connection. 
In other words, for all j € Vi, there is an axon going from neuron j to a dendrite of 
neuron i, and the corresponding weight Wij is adaptable in a learning process. The 
collection of neurons j with which i has no connection, or a non-changing synaptic 
connection, will be denoted as the complementary set, Vf . We suppose that the 
synaptic strengths wij between the neurons are changed in steps according to a rule 
of the general form 

w 'ij = w ij + Awi-j j e Vi (5) 

w'ij = w i3 j e Vf (6) 

where Amy is a function of the states Xk of all N neurons of the net and all afferent 
synaptic weights Wih (i fixed, k — 1,2, . . . , N). In general, the functions Awij will 
be linear in all Xk, since x\ = Xk (recall that Xk equals or 1), but non- linear in the 
weights Wik- In this article we suppose, however, that the Awij do depend linearly on 
the weights Wik- Hence, in this article, 

Awij = Awij(x,Wi) (7) 

is a linear function in all Xk and all Wik (k = 1, 2, . . . N). We abbreviated 

Wi := (wn, . . . ,w iN ) (8) 

It is unrealistic to describe a biological neural net as a deterministic system, since 
there are many unknown parameters that influence its development in time. We 
therefore choose a probabilistic description. We suppose that the neuron states Xj 
are mutually independent stochastic variables, i.e., the probability that neuron i has 
value Xi is given by a probability distribution Pi(x{) which is independent of j (j ^ i). 
Since the changes Aw,j of the weights are functions of the stochastic variables Xj 
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(i = 1,2,.. . N), and a function of a stochastic variable is a stochastic variable, the 
changes Aaiy, and, hence, the themselves, are stochastic variables. 

Now let Tij(w' i: j\wij, {wik}k=£j) be the probability that, due to a learning step, a 
transition takes place from the value my to the value my = my + Amy , for a given 
set {wik}k^ij- Then we have 

T ij(w'i j \w i j,{'Wik}k 7 tj) = ^2p{x)5(w'ij -Wij - Awij(x,Wi)) (9) 

xen 

where ft is the collection of all 2 N possible states of the neural net (xi = 0,1; i = 
1, . . . , N) and p(x) is the probability of occurrence of the network state x, which we 
suppose to be independent of the variables my . [The relation between p(x) and Pi(xi) 
is left unspecified at this stage of the reasoning; compare, however, ( p6| ) and ( p7[ ) 
below.] We have 

X>0«0 = i (io) 

The delta-function in (|s|) guarantees that only transitions take place which obey the 
learning rule (||). Using ( |Io|) we find from @, that 

T. l] {w[ J \w ll ,{w lk } k ^ J )dw[ 3 = 1 (11) 

Let Pij (my , n) be the probability of occurrence of the variable my at a time step n 
(n = 0, 1, 2, . . .). Then Py and Ty are related according to 

Py(my,n+ 1) =/•••/ Ty(my|{m^})n[^K fc ,n)dm^] (12) 
J J fe=i 
Demanding that the probability Py is normalized initially, 



/',,•: i/-, ,. ();(//r,, = 1 (13) 
we find from (|l2j) and (jll|), by induction, that 

Pij(wij,n)dwij = 1 (14) 
for all n. From (|ll|) and (O) it follows that 

Py (my , 71 + 1) - Py (my , 7l) = / . . . / [Ty (my | {m- fc })Py (my , 7l) 



-Ty (my | my , {m^} fe ^)Py (my , n)] J| [P ifc (m^ , n)rfm^ fc ]dmy 

(i=l,...,N;j GVi) (15) 

which is the so-called Discrete Master Equation for the weights toy. It masters the 
evolution of the weights function of n, and determines the values of the weights 

in the long run. 

In order to obtain an expression for the expectation value of the weights after 
infinitely many learning steps, we first consider the expectation value at time step n: 

(wij)n ■= / Pij{w ij ,n)wi : jdwi j (j&Vi) (16) 
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The latter expression yields, using the Master Equation (|l5|), 

(Wij)n+1 - K)„ = / ••• J WijlTijiWijliw'i^Pijiw'i^n) 

-Tijiw^Wij^w^k^Pijiwij (17) 

or, interchanging the primed and unprimed variables Wij and w'^ in the first term on 
the right hand side, 

(wij) n+ i - (wij) n = J ... J {w'ij ~w l j)T t j{w' lJ \w % j,{w' lk } k ^j)P l0 {w lJ ,n) 

x II l p iki w ikf njdw^dw'ijdwij (18) 
or, with (|^) and integrating over w 1 ^, 

{w ij ) n+ i-(w i , J ) n = ^2p(x) ■■■ ^Wij(x,w i )Y[Pik(wik,n)dw ik (j <E V$)(19) 

xen k=i 
or, with (0) and @, 

(Wij) n+1 - {Wij) n = ^2p(x)Awij(x 7 (tOi} n ), (j 6 Vi) (20) 

xen 

where we used that Aujy is linear in the Wik (k = 1, . . . , N) to replace Wi by the 
expectation value (wi) n in the expression for Aw, 3 -. If we assume that the expectation 
values of the synaptic connections (wij) n converge to finite values, (i%}oo, for n 
tending to infinity, we can solve this equation for n — > oo. This is the subject of the 
next section. 

3. Final values for the weights 

If we suppose that the left-hand side of (|2p| ) vanishes in the limit of n tending to 
infinity, we have 

5^p(x)Aioy(as, (to^oo) = (21) 
xen 

At this point, we need an expression for the increment Awy(ti), in the n-th learning 
step. We take the biologically motivated learning rule 

Awijin) =th[k- ji(x, n)](2x t - l) Xj (i = 1, . . . , N;j € VJ) (22) 

where rji is the learning rate, k the margin parameter and ^(x, n) the stability 
coefficient given by 

7 i (x,n) = (2zi-l)[fc i (aj,n)-0 i ] (23) 

[cf. eq. (||)]. Here, hi(x,n) is the membrane potential of neuron i at step n of the 
learning process 

N 

hi(x,n) =^w ik (n)x k (24) 
fc=i 
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and Qi the threshold potential of neuron i. It should be noted that in ( p4[ ) x k is the 
state of neuron k at step n of the learning procedure. Substituting (22) with ( p3[ ) and 
@) into (|l|) we find, using the fact that {2x t - l) 2 = 1, 

N 

^p(x){K(2x t - 1) - (^2(w lk ) 00 x k - 6i)] Xj = (25) 
xen k=i 

where we divided by the learning rate r\i. 

Up to now, the precise form of the probability distribution p(x) has been left 
unspecified. At this point, let us specify our probability distribution p(x) to be such 
that the chosen patterns x are centered around representative patterns To that 
end, we choose our probability distribution p(x) such that it is a sum of p equally 
probable, individually independent probability distributions, i.e., 

P (x) = Hjpr{x) (26) 

where p^(x) is factorizable, 

N 

P"(a0=ntf(*i) ( 2? ) 

z=l 

i.e., the neurons behave independently from one another. The quantity p^(xi) is the 
probability that, once the pattern index fi is chosen, neuron i is in the state Xi. One 
therefore has 

rf(0)+rf(l) = l (28) 

In a learning process, at every step n, the index \x is drawn from a collection of p equally 
probable pattern indices, thus fixing the probability distribution p^(x) according to 
which the pattern x is chosen for that learning step n. 

Let us denote averages with respect to the probability p^(x) by M 

p?(xi>i = a***, E tfM = (^T ( 29 ) 

0^=0,1 Xi=0,l 

implying, in view of ( p7| ) and (pq), 

E^(^i=^, ^X^^RT ( 3 °) 

Thus a bar with an index fi indicates an average with respect to the probability 
distribution p fJl (x). With the choice (|26|), the result (EEI) can be rewritten in terms of 
these averages, where we must take be aware that a term (a; 2 ) appears in the sum 
over k: 



p 

(w. 



,)ooEh 2 ) M -(^) 2 ] = E 

LL— 1 fl — 1 



A 



k(2x^ - 1) - (E( w «fe)oo^fe M - 0.) 
fe=l 

The latter result can be rewritten as 

poj(t%)oc = -'^2 l (Ai)j k {w ik )oo + Bij, jeVi (32) 
fcev, 



Learning by a neural net in a noisy environment - The pseudo-inverse solution revisited! 



where we abbreviated 



^=1 



x]f - (xjn 2 (33) 



and where we split up the sum over all A: in a sum over Vi and a sum over its 
complement : 

p 

{Ai) jk i=Yj^x^, i = l,...,N; j,k GVi (34) 

p 

Bij ■= Y^k^ - l ) - ( J2 («>ik}oxiF - 6 *)] xJ^ i = l,..., N; j G V5 (35) 

Note that the matrix A4 is a symmetric matrix, the dimension of which equals the 
number of indices in Vi, i.e., the number of adaptable afferent synaptic connections 
of neuron i. In the matrix B, we could write (u)ik)o rather than (wik) 00, since 
(wik)o = {wik) oc for fc E Vf . It is easy to solve the equation (|32|). First, rewrite 
it as 

^ [{Di)jh + {Ai)jk} {w ik )co = Bij (36) 
keVi 

where A is the diagonal matrix 

[IlKi-. po-?5 jk , j.fcG^ (37) 

The matrix A + Ai is non-singular, and can be inverted. Inserting the explicit form 
of ( p5| ) , we then find 

p 

Moo = ^ (A + A)7fe^K2x^ - 1) - ( X] (^)oW M - ft)] Zfc" j G Vi (38) 

where we used that A and are symmetric matrices. In the usual treatments of 
noiseless recurrent neural networks (qj = for all j), one finds for the Wij (00) the 
so-called pseudo- inverse solution ||, M, which reads, in our notation, 

p N 

= v>ii(o) + E (c^rw - 1) - (E w *wtk e ^ j e ^ ( 39 ) 

1 fc— 1 

where C" 1 is the inverse of the correlation matrix C^ v = ^2 keV . (, k (,k- Apparently, our 
result (|3q) is not a simple generalization of the standard result for noiseless recurrent 
neural networks. Note that the usual pseudo-inverse solution (p9| ) depends on the 
initial values ify(O) of all the weights, whereas our solution (|38|) depends only on 
Wij(0) for j G V/ 7 and not on the initial value ?% (0) of the changing weights (j 6 Vi). 
Apparently, a little bit of noise completely wipes out the effect of the initial state of 
changing connections, since the result (^8|) is true for any value of the noise unequal 
zero. 

In the limit that all o~j ( |33"| ) vanish, the set of equations ( |36| ) becomes under- 
determined for p < N, since the matrix A4 is then singular. Hence, the solution ( p8| ) 
does not exist for a noiseless net. Explicitly, this can be seen as follows. Let us suppose 
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that the p average patterns {x\^, X2^ , ■ ■ • , x p ^}, (/x = 1, . . . ,p), span a p-dimensional 
vector space. Then, for r > p there are coefficients a r i such that 

p 

X? = ^a r m" (40) 
1=1 

for all /i = 1, . . . ,p. It follows that every column (Ai)j r , (i fixed, j a running index of 
Vi and r a fixed number larger than p) is a linear combination of the first p columns of 
{Ai)j S (i fixed, j a running index of V, and s smaller than or equal to p). Consequently, 
the matrix Ai has a vanishing determinant, and is not invertible. Therefore, in case 
the average squared deviation ([33]) would vanish, the unique solution ([38]) would not 
exist. The fact that for vanishing variances <jj our set of equations for the final weights 
is under-determined has been mentioned already in the introduction, in the text under 
equation (^). 

In H the occurrence of the average squared deviation (S) has been overlooked. 
This enabled the authors to solve the Master Equation ([20Q in the usual way. By 
means of the so-called Gauss-Seidel procedure they obtained a modified version of the 
usual pseudo-inverse solution for the connections, rather than the expression (pq). 



4. Intermediate values for the weights 

Since our approach was simply based on the assumption of convergence of the {wij) n 
for n — > oo, we had no knowledge of the intermediate values of the weights (wij) n for 
finite n. However, we can predict the evolution of the weights through an iterative 
procedure. If we repeat the derivation in section ||, starting from (20) in stead of (pil|), 
we find 

p 

(wij) n+1 = (wij) n + ™J2 K (^ - Vi*? 

p N 

--£(^Kfc) w 3^ - 0,W - m^Mn j e v (41) 
" k=i 



In the limit n — > oo, equation ( pl| ) implies (pl|), provided that the weights 
converge. 



Using the relation (41), one can find, by numerical iteration, the quantities {wij) n 
for any n, given the starting values (wij)o- Hence, we can verify numerically that 
the (wij) n are independent of these starting values. Moreover, one can study the 
convergence of the learning procedure. In order to do so, one must make a particular 
choice for the probability distribution p^(x), which, up to now, was left unspecified. 
For our choice [see (|58|)1, this distribution will depend on a so-called noise parameter b 
(0 < b < 1), such that xj 11 = and cr? = if the noise parameter b vanishes (b = 0). 
Through the parameter b we can tune the amount of noise during the learning process. 
Numerical calculations show that the (wij) n do indeed converge in the limit n — * oo, 
for arbitrary &, including b = 0, if rji is small enough. Interestingly, convergence times 
to the final values ( |38| ) diverge for a decreasing noise parameter b (i.e., b — > 0), but 
the time of convergence drops to a small value if b = (see figure 0), indicating that 
something peculiar happens in this limit. 

In other words, if one demands existence of the solution ([38]), one may choose 
b arbitrarily small, but not zero, and convergence to the solution is faster for larger 
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Figure 1. Convergence time as a function of the training noise parameter. The 
number of iterations M of the recursive formula ( ttll ) is plotted as a function 
of the training noise parameter b. The number M is determined with the 
help of the criterion that it is the smallest value of n for which the condition 
\ ( w ij)n — {v>n)aa\ < 0-01 is satisfied for a fixed value of i. The network 
consists of N = 128 neurons, the number of patterns is p = 16. Furthermore, we 
chose 9i = for all i and n = 1. The learning rate is r\i = 0.25 for all i. The 
smaller b, the more iterations are needed to obtain the exact final value (lOij)oo- 
However, for b = 0, M drops to 50. In this case, of course, (uiij) 



values of b. If one puts b to zero in the iterative application off^TJ), one observes rapid 
convergence of the weights, to the pseudo- inverse values (p9|). Maybe surprisingly, 
these values have no continuous relation with the values for finite b, despite the fact 
that the expressions (ffl|), the difference equations that determine the weights Wij, do 
depend continuously on the tr|, and, hence [see equation (^lj) below], on b. In view of 
the difference in the solutions (toy)oo for the cases b — [eq. (§8|)] versus 6^0 [eq. 
([§9|)], this is obvious: there cannot be a continuous relationship between them, since 
the pseudo- inverse solution ( |3^ ) depends on all the initial values Wq (0) , whereas our 
solution (|3^) is independent of the initial values of the weights Wij (0) for j G Vi. 

In the next section we investigate whether our final solution for the weights 
corresponds to the storage of patterns in a stable way. 

5. Stability 

It is well-known that a neural net with fixed weights Wij (in our case this will be after 
the learning phase) and deterministic neuron dynamics evolves, in the course of time, 
to limit cycles of finite length n (n = 1, 2, . . . , 2 N ). Cycles with n = 1, or fixed points, 
are of particular interest in neural network theory. If a pattern £ M is a fixed point of 
the dynamics of a neural net for a given set of weights tuy , the stability coefficients (||) 
or (E3) are positive for all i, in which case the system remains in the pattern i.e., 
the system is stable po| . Besides the fact that 7i(£ M ) is a measure for the size of the 



basin of attraction of fixed point £ M 11 1 , it is plausible that it is also a measure that 
determines to what extent the network state x remains in a neighborhood of £ M when 
the deterministic evolution of this neuron state x is replaced by a stochastic version 
of this evolution. In order to get an idea of the effectiveness of the learning process 



such that the correlations with the £j* would disappear. We therefore have to take a 
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discussed in the preceding section, we will therefore consider the expectation value of 
the stability coefficients (^) in the limit n — > oo: 

N 

(TiCToc = (2tf - 1)(X>«>«># - 6i) (42) 

3=1 

Once a set of patterns is known, these quantities can be explicitly calculated with 
the help of the expression (|38|). In this section we will attempt to derive, analytically, 
an approximation of ( ff^ ) by averaging over sets of patterns with given mean 

activity a. If, however, we would calculate the average of (icy)oo over these patterns 
directly, we would lose all dependence on neuron indices j and pattern indices \i, 

> i j i j < I i i< »i i n Willi i i i v k' 

different route. 

An approximation for the expectation value (|4^) is 

(7i(n)oo«(2cr-i)((^)oc-^) (43) 

where /li^ is the membrane potential of neuron i averaged with respect to p M (cc): 

JV 

= ^WijXj" (44) 

3=1 

In fact, the approximation (f43| ) would be exact if T]^ would be equal to for all j, 
i.e., in the limit that the probability function is such that xl^ equals for all j. The 
average potential occurring in ( fi"3| ) can be found from (|3l|). Indeed, multiplying by 
xJ M and summing with respect to j £ 1^, we find from this equation: 

p N 

j^^Mooprf = £ [<^ u - 1) - E< u, <fc>~ 2 * 4 ' - J2^ V ^ ( 45 ) 

jeVi f=i k=i jeVt 

where we also used (|33|). 

The average square deviation a 2 occurring in this equation depends on the neuron 
j. In this article we will consider the case in which all neurons have the same standard 
deviation o~j 

a j = a j = l,...,N (46) 

i.e., the probability Pj(xj) is supposed to be such that the uncertainty to find neuron 
j in a state £j* is the same for all neurons of the neural net. Using the identity 

N 

^xj^jwi^oopo- 2 = (^xj^iwij) oo - £ ^{wij)o^jpo- 2 (47) 

jev z j=i jevf 

we find from ( |45| ) 

p 

{(fk^oc - Moxj^po- 2 = 5>(2aT - 1) + 0i - (TTM^xT^ (48) 

jeVP "=i jeVi 

where we used (wij)o = (wy)oo for j £ Vf ' . An alternative form for ( |4^ ) reads 
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where 1 is the p x p unit matrix and where C^ v , the correlation matrix for averaged 
neuron states, is defined by 

C7 := (50) 

Furthermore, we abbreviated 

ft = k(2x-» - 1) + Bi (51) 
g$ =P<t 2 J2 (52) 



c 



Multiplying both sides of the matrix equation ( 49 ) by the inverse of the (symmetric) 
matrix occurring on its left-hand side we obtain the solution 

(hi^oo = E /rcr A [(p<7 2 i+a)- 1 ]^+^r[(p^ 2 i+a)- 1 ] i/ ' i (53) 

i/,A=l y=l 

Once a particular probability distribution p IJ '(x) of patterns centered around £ M 
is given, we can evaluate /" and , and, hence, via (|||), the average stability 
coefficient (143). 



In contrast to our expression (38 ) for the expectation value of the final value of the 
weights (Wij)oo, the result for the expected average potential (/ii M )oo = Y^j( w ij)ooXj t ' 
does exist for vanishing a. This is clear, already, from ([53|), in which the existence of 
the inverse (pa 2 t + C;) -1 does not depend on the presence of the extra term pa 2 l as 
long as the average patterns x^ are linearly independent, since then is invertible. 
Using ( [38] ) and (|lj), and assuming (wij)o — for all j G Vp we may write the average 
potential {hi ) 00 as 

(^)oo = £ £ (a + ^v^/r (54) 

Comparing this to ( |53| ) with (^), we obtain the identity 

J2 (A + Adjfetog? = YP? X [( P aH + Q)-^ (55) 
k,jev t x 

Hence, though the matrix (Di + AA^ 1 occurring in j5^ ) itself does not exist for 6 = 0, 
the above combination clearly does: it reduces to S^ u , as we see from the right hand 
side for a = 0, implying that in the limit of vanishing noise 

(^)oo = ft (56) 

which is already clear from (p3J) and is equivalent to — the average of— eq. (|). Thus, 
although the values of the weights themselves do not have a continuous relation with 
the values corresponding to the pseudo-inverse solution, the average values for the 
membrane potentials, and, therefore, of the stability coefficients, do. 

In the following we suppose that Wij = for all j e Vf . This corresponds to 
a neural net in which all existing connections are of adaptable strength and the only 
connections with constant strength are the non-existing connections. For j G Vp , we 
then have Wij(0) = 0, and, hence, (wij)o = 0, implying that 

9tt = (57) 
Let us choose the probability distribution 

p%(x j ) = (l-b)5 x .^+b6 x . A _ i » (j = l,...,N) (58) 
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which fulfills (|2§|), and from which p^(x) follows by the prescription (|27|). The noise 
parameter b is a probability (0 < b < 1). More specifically, for given fj, (ji = 1,2, ... ,p), 
1 — 6 is the probability that the activity Xj of neuron j equals that of the pattern £^ . We 
suppose that b is small compared to unity. As follows from (|58|), the noise parameter 
b is related to the width of the distribution of input patterns around each pattern 
We can immediately calculate the average neuron state ([3TJ) associated with the 
distribution ( |58| ) 

^(^) = (l-b)S uf +bS 1A _^ (59) 



the coefficient (|5lj) 

f? = (2tf-l)(l-2b)K + l (60) 



as well as the average squared deviation 

a 2 = 6(1 - b) (61) 

The fact that a 2 is j-independent is a consequence of the particular choice ([58|) for 
Pj(xj), i.e., of the fact that all neurons j are supposed to have the same uncertainty 
to be in state £j\ Hence, the supposition ( |i^ ) is satisfied. 

Let us suppose that the probability that £^ = 1 is a, for each j independent of 
any other neuron index k, and, hence, that the probability that £j* = is (1 — a), for 

all of the patterns £ , . . . , We can now use this to arrive at an estimate value for 
the average potential of neuron i, eq. (^), which is exact in the limit of p — > oo, for 
a = p/N fixed, and smaller than 1. From ( |50|) we find, for fi / v, 

cr « E {»V(i)^(i) + o(i - «)^(i)^(o) + 

(1 - 0)03^(0)3^(1) + (1 - a) 2 x]»(0)xj"(0)} (62) 
while for p = v we get 

CT « E {^(i) 2 + (1 - «)^(0) 2 } (63) 

Defining the dilution d as the average fraction of neurons from which an arbitrary 
neuron does not have an incoming connection, each neuron has on the average N(l— d) 
incoming connections. Hence, using (59), we find from (]6^) and ( |63|) 

Cf « JV(1 - d){o(l - o)(l - 26) 2 ^ 

+ [a 2 (l - 6) 2 + 2ab{\ - o)(l - 6) + (1 - a) 2 6 2 ]} (64) 

We thus have achieved that the correlation matrix C^ v for an TV neuron net has been 
expressed in parameters typical for the network, namely the dilution d, the mean 
activity a and the noise b. An alternative way to write (^4|) is 

Cr « W v + m (65) 

where I and m are shorthand for combinations of the typical network parameters a, b 
and d 

I :=N(l-d)a{l-a)(l-2b) 2 

m := N(l - d)[a 2 (l - b) 2 + 2o6(l - o)(l - 6) + (1 - afb 2 ] (66) 
With (|65|), the matrix occurring in (^3|) can be cast into the form 

{pa 2 1 + dy « (pa 2 + 0<5^ + m (67) 
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The inverse of a p-dimensional matrix A with elements A^ v = x5 ,iv + y is given by 
the matrix A~ x with elements 

(A-'r = [S^(x + py)-y]/x(x + py) (68) 

From (|65|), and ( |68| ) applied to j67|), we hnd 

Vr^tf^, , r s-l,\v „ l[l+p(o- 2 +m)]S^ + mpa 2 

^ [(pa 1 + Q) ] « ( , +p(r2)[ j +Ka2+m)] (69) 



Substituting this result, together with ( |57|) , into the expression (p3|), yields for the 



average potential of neuron i in pattern /i the expression 



( 1 }o ° (l+po-2)[l + p(o-2 + m)] [m) 

The sum over the f" occurring in this expression can be calculated with (|60|), 

;/r«(p-l)[(2o-l)(l-26)« + fl i ] (71) 

where we used that the average value of the p — 1 neuron activities £f can be 
approximated by a, the average activity of the net. We can now write down the final 
result for the stability parameters ( fl3| ) , which is a function of the network parameters 
d, a and &, the number of patterns p, the number of neurons of the net N, and the 
neuron properties k and di'. 

{l[l + p(a 2 + m)] + mpa 2 }[{l - 2b) K + fl f (2ff - 1)] 
m ^>°°~ (/+ pa 2 )[l +p(a 2 +m)} 

i mpa 2 (p-l)[(2a-l)(l-2b) K + t }(2^-l) 

+ (l+p^)[l+p(^+m)} ' ( ' ] 

Note that for a 2 = we immediately recover (Jh )oo — /f and (7i(£f))oo = k, as we 



{Z[i +p(a 2 + m)] + mpa 2 }jf + mpa 2 ^ v ^ 



should, from the equations fl70D and ( |72|) respectively. 

The final average stability coefficient of neuron i takes two different values 
respectively, depending on whether £f = 1 or = 0. 

In figure (g) we plotted this quantity for a chosen average activity a = 0.5, as 
a function of b. It is clear that the stability coefficient can be expected to remain 
positive. In the same figure we plotted the actual values of (7i(£f ))oo> as obtained by 
choosing randomly a set of patterns with given mean activity a = 0.5, and using 
© with (f|§ for the calculation. 

The difference between the curves is evident, and indicates that we must be careful 
not to overestimate the accuracy of our result as an indication for (7i(C| i ))oo- In fact, 
in a large region, the storage of undisturbed patterns is better than our estimate 
suggests by a factor 2 to 3, as can be concluded from the figure. With this in mind, 
we may assume that after the noisy learning process, the patterns £ M are indeed fixed 
points under the deterministic network dynamics for a small noise parameter b. 



6. Retrieval and basins of attraction 



In this section we address the question what happens to the average size of the basins of 
attraction if noiseless learning (training parameter b = 0) is compared to noisy learning 
(b 7^ 0). After the network has been trained with patterns x with noise b (b ^ 0), we 
numerically check the retrieval capacity of the net by presenting patterns with noise 
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Figure 2. Average stability as a function of the training noise parameter b. The 
average, over all neurons i (i = 1, 2, . . . , 256), and all patterns (/i = 1, . . . , 32), 
of a — diluted — neural net (dilution d = 0.2), of the stability coefficients 7i(£f ), 
with threshold potentials 9i = for all i, as a function of the noise parameter 
b. The curves with symbols (+) and (()), for = +1 or £^ = 0, respectively, 
correspond to a numerical simulation, and should be compared to the curve with 
symbol (□), obtained from the approximative expression The approximation 

itself is rather poor. However, it gives an indication, as the various curves show. 
Note that the appearance of one single curve in the approximated case is due to 
the choice a = 0.5 and 8i = 0. 



b* , i.e., patterns chosen according to a probability distribution of the form (|5q), in 
which b has been replaced by b*. The presented patterns evolve under deterministic 
parallel dynamics Xi(n + 1) = 0(/ij(rt) — f?j). The attempt to retrieve a pattern is 
successful if the network state x runs into a fixed point equal to the clean, undistorted 
pattern £ M of which a noisy version was the initial state. The result can be read off 
from figure ^. Since the curves obtained via noisy learning lie above the curve with 
noiseless learning, the basins of attraction are, apparently, enlarged in the presence of 
noise during the learning stage. 

The result is in agreement with earlier studies by Gardner et al || , and Wong & 
Sherrington §,§,§. 

In g, like in our case, noise is added to patterns during a training stage. However, 
the algorithm is of a different kind, because it includes an error-mask, i.e., the weights 
are updated if and only if, upon presentation of a noisy pattern during the learning 
stage, the membrane potential hi has the wrong sign. In this way, if the learning 
algorithm converges, retrieval of patterns for which the amount of noise is equal to 
that of the training patterns, is guaranteed. 

I 11 07 01 an d Hi, various retrieval properties of a neural network are discussed. 
It is argued that optimizing (by finding the optimal weights) the overlap of a noisy 
pattern with its corresponding training representative after one retrieval step is, in 
fact, a way of noisy training ||. The optimal network is sought for via a replica- 
calculation that minimizes a cost-function, thus optimizing the first step retrieval. No 
explicit learning rule is used in these articles. An explicit expression for the final 
values of the weights is not given. 

Our approach is different from those discussed above, in the sense that we start 
from an explicit learning rule, which is biologically acceptable: it is derived from the 
principle that energy cost for synaptic adaptation is minimal [Q; it is a function of 
local variables; it does not contain error masks; neurons are assumed to be noisy. 
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Figure 3. Fraction of retrieved patterns for networks trained with various values 
of the training noise parameter b, as a function of the noise b* in the patterns to 
be retrieved. The average fraction of retrieved patterns for a network of N = 128 
neurons with thresholds 9i = 0. The values of the dilution and the activity are 
d = 0.2 and a = 0.5. The network has been trained with p = 32 patterns. 
Starting from an initial pattern with noise b* , the network evolves under parallel 
deterministic dynamics. The pattern £ M is said to be retrieved if the network state 
overlap with this pattern, := iV _1 X]£Li( 2 £f ~ l)(2xj - 1)> is equal to 1. At 
most 10 dynamical time steps are applied for every initial pattern. The various 
curves (+), (x) and (*) correspond to b = 0, b = 0.05 and b = 0.1 respectively. 
The value of b for which the size of the basins of attraction is maximal — for the 
values of b chosen in this figure — is b = 0.1. The curve with b = (noiseless 
training) lies below the curves corresponding to learning with noise. 



Though in our network the basins of attraction are not optimal (it was not our goal 
to optimize the basins of attraction), we do have an explicit learning algorithm as well 
as a final expression for the expectation value of weights. 

7. Conclusion 

We have shown that learning with noise leads to final values for the weights Wij 
which arc different from those found in the corresponding situation without noise. 
Surprisingly, the solution for the values of the weights Wij of a noisy system in the 
limit of vanishing noise, does not converge to the values of the solution of the system 
without noise. 

Moreover, in a system without noise the values of the final weights depend on 
the initial values of all weights, whereas in a noisy system the initial values of the 
changing weights, the for j e Vi, are wiped out in the course of time. 

Our noisy trained networks have larger basins of attraction than noiselessly 
trained networks. This is in agreement with earlier findings in the literature. The 
exact dependence of the retrieval properties on various parameters, such as the mean 
activity a and the memory load a = p/N is still to be elucidated. 



Learning by a neural net in a noisy environment - The pseudo-inverse solution revisitedlG 
Acknowledgment 

The authors are indebted to Wouter Kager for carefully reading this manuscript and 
suggesting some improvements. 



References 

[1] Miiller B, Reinhardt J and Strickland M T 1995 Neural Networks: An Introduction (Berlin: 
Springer) 

[2] Personnaz L, Guyon I and Dreyfus G 1985 J. Physique Lett. 46 L359. 
[3] Diederich S and Opper M 1987 Phys. Rev. Lett. 58 949. 

[4] Hcerema M and van Leeuwen W A 1999 J. Phys. A: Math. Gen. 32 263-86. 
[5] Gardner E J, Stroud N and Wallace D J 1989 J. Phys. A:Math. Gen. 22 2019-30. 
[6] Wong K Y M and Sherrington D 1990 J. Phys. A: Math. Gen. 23 L175-82. 
[7] Wong K Y M and Sherrington D 1990 J. Phys. A: Math. Gen. 23 4659-72. 
[8] Wong K Y M and Sherrington D 1993 Phys. Rev. E. 47 4465-82 
[9] Heerema M and van Leeuwen W A 2000 J. Phys. A: Math. Gen. 33 1781-95. 
[10] Kinzel W and Opper M 1991 Models of Neural Networks ed Domany E et al (Berlin: Springer) 
p 152 

[11] Gardner E 1988J. Phys. A: Math. Gen 21 257-70. 



